For testing, use \(\alpha=0.05\) unless otherwise specified. Datsets can be found under our shared drive. Please do all three exercises.

Exercise 1

Low birth weight (LBW) infants have lower infant mortality in groups in which LBW is most frequent. For example, in 1991, US infants born to smokers had higher risks of both LBW and infant mortality than infants born to nonsmokers. However, among LBW infants, infant mortality was lower for infants born to smokers (relative rate = 0.79). There are competing theories regarding this so-called “paradox.”. See Hernández-Díaz et al, (2006).

This exercise examines the differences between using the linear regression model versus the generalized linear regression model where the distribution of the response variable is considered normal. You will find the data in the file practical2_lowbwt.sav. The codebook for this data can be found below

column name description
ID Low Birth Weight Baby (1=Yes under 2500g, 0=No)
LOW Mother’s age in years
AGE Weight at Last Period
LWT Race (1=White, 2=Black, 3=Other)
SMOKE Smoke during Pregnancy (1=Yes, 0=No)
PTL History of Premature Labour (# of times)
HT History of Hypertension (1=Yes, 0=No)
UI Presence of Uterine Irritability (1=Yes, 0=No)
FTV Visits to Doctor During 1st trimester
BWT Baby’s birth Weight in Grams
  1. Perform a linear regression of the variable BWT (birth weight) on the explanatory variables SMOKE and LWT (weight of the mother during last menstrual period), and interpret the coefficient associated with the variable LWT.

  2. Compare these results with the results of an analysis in GzLM (SPSS users) or glm(...) (R users) in which you define the distribution of the response variable as normal (let the software create confidence intervals for the regression coefficients). Do the results agree? Where do they differ and why? In your answer, please comment on the similarities and difference in the diagnostics when comparing the two models.

For R Users!

use m <- lm(...) to evaluate the the model and confint(m) for the confidence intervals. If you have two models, m1 and m2, you may compare them by using modelsummary(list(m1, m2), stars=TRUE). You will need to install and load the package modelsummary.

  1. What are the degrees of freedom associated with the two deviance indicators? Explain why these indicators have those values.

  2. BONUS QUESTION: Calculate the deviance for the following: the null model, the residual deviance for the model in part a and for the model in part b.

For R Users!

to answer these questions, you to use the residuals of the models, which you can calculate by typing residuals(m), where m is your model, the output of the lm or glm function.

Remember:

  1. Make a cross tabulation (aka contingency table) of the variables LOW and SMOKE; print the percentages of LOW within the SMOKE-groups in your cross tabulation as well. What do you conclude from the table about the association between smoking and low birth weight?

BONUS: If you use xtabs in R and the formula ~ SMOKE + LOW, you can run a chi-square test by typing summary(xtabs(...)). Try to run this and interpret the results!

For SPSS Users

Analyze > Descriptive statistics > Crosstabs

For R Users

You could either use the xtable command for this, or use the functions from the janitor library, namely tabyl, adorn_percentages, adorn_pct_formatting and adorn_title.

  1. Perform a logistic regression of the variable LOW on the explanatory variable SMOKE by building a Generalized linear model, using the logit link function. Take LOW as the dependent and SMOKE as predictor. What is the interpretation of the coefficient of SMOKE? Compare your results with the results of the two models you created in the previous exercise. Pay attention to similarities and differences in the diagnostics of the two types of models. Which (if any) of the three models is most informative or useful?

Exercise 2

This exercise further explores the Dupuytren data we already encountered in the lectures. It builds and compares various models. You will need to use the file practical2_dupuytren.sav.

  1. Is the probability of having Dupuytren related to Age? Fit a Binomial model with logistic link function compare your results with the results shown in the recorded lecture. Compare the model with and without correcting for over-dispersion. To correct for over dispersion, estimate the scale in two ways: first, using Pearson’s chi-square \(\hat{\phi}_P=\frac{\chi^2}{N-p}\) and second, using the deviance \(\hat{\phi}_D=\frac{D}{N-p}\) (see slide 103.)

For R Users

To estimate the scale, calculate the sum of squares of Pearson’s residuals, which you can find as follows: residuals(mdl, type="pearson").

You then print out the model using the summary and the dispersion argument as follows: summary(mdl, dispersion=scale). Alternatively, you can calculate the model with over-dispersion directly using glm with family=quasibinomial.

To calculate the predicted values, use the predict function and add the new data you want to predict in the function call. Make sure you use the correct type for the prediction (response) and pass the dispersion you are interested in. When evaluating the 95% confidence interval, use the formula \(\hat{Y}\pm1.96*SE\).

  1. Predict the 95% CI for the probability to have Dupuytren at the age of 78, according to the Binomial model. Reproduce the graph on slide 92 (see below) and then compare your estimate to the results that can be read off the graph.

For SPSS Users

Add a fictitious case with Age = 78 and Total = 100 to the dataset. Run the analysis again and let SPSS save the predicted value of the mean response and its lower and upper bounds.

  1. Fit a Poisson model to the data, with and without a correction for over-dispersion (use Pearson Chi-square). Mind the offset! Compare the results with the results of the lecture. Which model do you prefer, and why?

Exercise 3: Simpson’s Paradox

In 1973 UC Berkeley was nearly sued for sex-discrimination, all due to an interesting statistical phenomenon called Simpson’s paradox. In this exercise, we are going to investigate the dataset of admissions to UC Berkeley in 1973, which you can find in practical2_UCBadmit.sav. The data contains the number of applicants applying and the number accepted to UC Berkeley in 1973. You will then judge whether the data supports the claim, that it may be guilty of sex discrimination.

  1. Compare the acceptance rate among male and female applicants to UC Berkeley in 1973. What do you conclude?

  2. Create a table, comparing the acceptance rates among male and female applicants in each department separately. What do you conclude? Are the numbers consistent with those in your answer to the previous question? Explain.

For SPSS Users

You can use the following commands: Analyze > Descriptive statistics > Crosstabs

  1. Perform a binomial regression of the applications and admittance on the explanatory variable applicant.gender by building a Generalized linear model, and using the logit link function. What is the interpretation of the coefficient of applicant.gender? Show that the goodness of fit measures suggest that the model may be inadequate.

  2. The previous model did not correct for over-dispersion. Compare it to a model that does correct for over-dispersion, estimating the scale in two ways: using Pearson’s chi-square and using the deviance.

  3. To the model above, add the fixed effects of the departments by creating the appropriate dummy variables. What does this model suggest about the possibility of sex discrimination in acception rates?